Logarithmic Regret for Episodic Continuous-Time Linear-Quadratic Reinforcement Learning Over a Finite-Time Horizon
نویسندگان
چکیده
We study finite-time horizon continuous-time linear-quadratic reinforcement learning problems in an episodic setting, where both the state and control coefficients are unknown to controller. first propose a least-squares algorithm based on observations controls, establish logarithmic regret bound of order $O((\ln M)(\ln\ln M))$, with $M$ being number episodes. The analysis consists two parts: perturbation analysis, which exploits regularity robustness associated Riccati differential equation; parameter estimation error, relies sub-exponential properties estimators. further practically implementable discrete-time piecewise constant achieves similar additional term depending explicitly time stepsizes used algorithm.
منابع مشابه
Two-warehouse system for non-instantaneous deterioration products with promotional effort and inflation over a finite time horizon
In the current global market, organizations use many promotional tools to increase their sales. One such tool is sales teams’ initiatives or promotional policies, i.e., free gifts, discounts, packaging, etc. This phenomenon motivates the retailer/or buyer to order a large inventory lot so as to take full benefit of promotional policies. In view of this the present paper considers a two-warehous...
متن کاملIteratively Extending Time Horizon Reinforcement Learning
Reinforcement learning aims to determine an (infinite time horizon) optimal control policy from interaction with a system. It can be solved by approximating the so-called Q-function from a sample of four-tuples (xt, ut, rt, xt+1) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by deter...
متن کاملInventory Model for Non – Instantaneous Deteriorating Items, Stock Dependent Demand, Partial Backlogging, and Inflation over a Finite Time Horizon
In the present study, the Economic Order Quantity (EOQ) model of two-warehouse deals with non-instantaneous deteriorating items, the demand rate considered as stock dependent and model affected by inflation under the pattern of time value of money over a finite planning horizon. Shortages are allowed and partially backordered depending on the waiting time for the next replenishment. The main ob...
متن کاملContinuous-Time Hierarchical Reinforcement Learning
Hierarchical reinforcement learning (RL) is a general framework which studies how to exploit the structure of actions and tasks to accelerate policy learning in large domains. Prior work in hierarchical RL, such as the MAXQ method, has been limited to the discrete-time discounted reward semiMarkov decision process (SMDP) model. This paper generalizes the MAXQ method to continuous-time discounte...
متن کاملLogarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic on...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Social Science Research Network
سال: 2021
ISSN: ['1556-5068']
DOI: https://doi.org/10.2139/ssrn.3848428